

# ENGINEERING IN ADVANCED RESEARCH SCIENCE AND TECHNOLOGY

ISSN 2040-7467 Vol.01, Issue.02 May-2025

Pages: 1030-1043

# A NOVEL RESILIENT POWER AND DENSITY OPTIMIZED BUFFER-BASED ROUTER DESIGN USING CLOCK GATING AND RR

### <sup>1</sup>P. SANDHYA RANI, <sup>2</sup>G.V RAMANA

<sup>1</sup>M. Tech, Dept. Of ECE, VSM college of engineering, Ramachandrapuram, A.P <sup>2</sup>Guide, Assistant professor (Ph. D) Dept. Of ECE, VSM college of engineering, Ramachandrapuram, A.P

### **Abstract:**

The router is a" Network Router" has a one input port from which the packet enters. It has five output ports where the packet is driven out. Packet contains 2 parts. They are Header, and data. Packet width is 16 bits and the length of the packet transferring can be between 1 bit to 16 bits. The switch drives the packet to respective ports based on this destination address of the packets. Each output port has 2-bit unique port address. If the destination address of the packet matches the port address, then switch drives the packet to the output port, Length of the data is of 12 bits.

In this paper the Xilinx ISE EDA Tool is used for synthesis and simulation. In the proposed design the efficient power optimised logic clock gating is designed with reduced number of states. Due to reduction of states the amount of time to produce the response became less obviously the frequency is improved. **Keywords**: Crossbar, Arbiter, FIFO, Round robin algorithm, NOC, Buffer, Clock Gating, Power aware, Dynamic allocation.

Introduction: Interconnects has been crucial for the implementation of parallel and high-performance computing technologies because ON-CHIP communications have a significant impact on the overall area, performance, and power consumption of modern system-on-chips (SoCs). Amdahl's law states that increasing communication overhead degrades the speedup achieved by parallel computing [1]. Networkson-chips (NoCs) are the most scalable interconnection paradigm that is capable of meeting different performance requirements of heavy workloads [2], including latency via adaptive routing [3], throughput via improved path diversity [4], power dissipation by optimizing the NoC to targeted workloads [5], and flexibility by run-time configuration [6]. On-chip processing elements (PEs) are considered as network nodes connected by routers and switches, while data in NoCs are handled as packets. NoCs offer a scalable alternative to massive SoCs, but they come with high resource overheads and greater power consumption [7]. The transaction is divided into four levels by the NoC layering

Volume.01, IssueNo.02, May-2025, Pages: 1030-1043

model: 1.application, 2.transport, 3.network, 4.physical layers [8]. The fundamental unit of the NoC physical layer is a crossbar. A crossbar switch is a shared communication channel that uses multiple access to facilitate the interchange of physical packets. Time-division multiple access (TDMA), in which the physical link is time shared between the interconnected PEs [9], and space division multiple access (SDMA), in which a dedicated link is established between every pair of interconnected PEs [10], are the primary resource sharing techniques used by current NoC crossbars. An NoC router's physical layer additionally contains storage and buffering devices [7]. Another medium sharing method that makes use of the coding space to provide simultaneous medium access is code-division multiple access (CDMA). Each transmit-receive (TX-RX) pair in a CDMA channel is given an individual bipolar spreading code, and the data spread from all transmitters is added together to create an additive communication channel. Since there is no cross correlation between orthogonal spreading codes in standard CDMA systems, the received sum can be correctly decoded by a correlator decoder at the CDMA receiver. WalshHadamard orthogonal codes are used in classical CDMA systems to allow for medium sharing. For both bus and NoC interconnect topologies, CDMA has been suggested as an onchip interconnect sharing approach [11]. Reduced power consumption, fixed communication delay, and decreased system complexity are just a few benefits of using CDMA for on-chip interconnects [12]. A CDMA switch offers a good compromise between the two since it has less wiring complexity than an SDMA crossbar and less arbitration overhead than a TDMA switch. However, the on-chip connection literature has primarily examined fundamental aspects of the CDMA technology. Overloaded CDMA is a well-known medium access technique utilized in wireless communications where the number of users sharing the communication channel is boosted by raising the number of usable spreading codes at the expense of rising multiple-access interference (MAI) [13]. On-chip interconnects can have their interconnect capacity increased by implementing the overloaded CDMA idea. In previous work, we used the overloaded CDMA idea to CDMAbased on-chip buses and proposed two methods to boost the bus capacity by 25% and 50%, respectively: MAIbased and difference-based overloaded CDMA interconnects [14], [15]. In order to improve the CDMA router capacity by 100% at minimal cost, we in this article apply the overloaded CDMA idea to NoCs and propose an original overloaded CDMA interconnect (OCI) crossbar design. A router is a device that forwards data packets between computer networks. This creates an overlay internetwork, as a router is connected to two or more data lines from different networks. When a data packet comes in one of the lines, the router reads the address information in the packet to determine its ultimate destination.

**LITERATURE SURVEY:** Channamallikarjuna Mattihalli et al in [1] give a networking solution by applying VLSI architecture techniques to router design for networking systems to provide intelligent control over the network. Attempt to provide a multipurpose networking router by means of Verilog code, thus we can maintain the same switching speed with more security as we embed the packet storage

Volume.01, IssueNo.02, May-2025, Pages: 1030-1043

buffer on chip and generate the code as a self independent VLSI Based router. The approach will results in increased switching speed of routing per packet for both current trend protocols, which we believe would result in considerable enhancement in networking systems. Feng Liang et al in [2] proposed a novel test pattern generator (TPG) for built-in self-test. His method generates multiple single input change (MSIC) vectors in a pattern, i.e., each vector applied to a scan chain is an SIC vector. Are configurable Johnson counter and a scalable SIC counter are developed to generate a class of minimum transition sequences. The proposed TPG is flexible to both the test-per-clock and the test-per-scan schemes. Results show that the produced MSIC sequences have the favorable features of uniform distribution and low input transition density. James Aweya et al in [3] give attention to new powerful architectures for routers in order to play that demanding role. In this work, he identified important trends in router design and outlines some design issues facing the next generation of routers. It is also observed that the achievement of high throughput IP routers is possible if the critical tasks are identified and special purpose modules are properly tailored to perform them. M. Sowmya et al in [4]he attempt is to give a onetime networking solution by the means of merging the VLSI field with the networking field as now a days the router is the key player in networking domain so the focus remains on that itself to get a good control over the network. This paper is based on the hardware coding which will give a great impact on the latency issue as the hardware itself will be designed according to the need. Crossbar switches that use CDMA as their medium access mechanism benefit from predictable transaction latency and minimal arbitration overhead. A scalable CDMA-based peripheral bus has been proposed by Nikolic et al. [16] in order to reduce the number of PTP buses and parallel transfer lines while avoid the overhead caused by TDMA arbiters. Because fewer lines are utilized to add and transmit the data from the peripherals, this method lowers the number of pins when used at the interface connecting multiple peripherals to multiple PEs. Since peripherals typically run at lower frequencies than master PEs, the increase in transaction latency caused by data dispersion is acceptable. Crossbar switches with CDMA as their medium access method have low arbitration overhead and consistent transaction latency. Nikolic et al. [16] created a scalable CDMA-based peripheral bus that avoids the overhead caused by TDMA arbiters and reduces the number of PTP buses and parallel transfer lines. This strategy reduces the number of pins needed at the interface linking multiple peripherals to multiple PEs since fewer lines are needed to add and send the data from the peripherals. Data dispersion causes an increase in transaction latency, however this is to be expected since peripherals normally operate at lower frequencies than master PEs. Crossbar switches that use CDMA as its medium access technique feature consistent transaction latency and low arbitration overhead. A scalable CDMA-based peripheral bus was developed by Nikolic et al. [16] that eliminates the need for PTP buses and parallel transfer lines while avoiding the overhead imposed on by TDMA arbiters. Because fewer lines are required to add and transfer the data from the peripherals, this technique lowers the number of pins required at the interface connecting multiple peripherals to multiple PEs. Transaction latency increases as a result of

Volume.01, IssueNo.02, May-2025, Pages: 1030-1043

data dispersion, however this is to be expected given that peripherals typically operate at lower frequencies than the TDMA bus of master PEs. In the CT-Bus, where data are multiplexed over both the time and code domains, CDMA and TDMA have been merged [12].

### **Existing Method:**



Fig:1 Existing Block diagram

The Five Port Router Design is done by using of the three blocks. The blocks are 8-Bit Register, Router Controller and output block. The router controller is design by using FSM design and the output block consists of four FIFO's combined together. The FIFO's store data packets and when you want to send data that time the data will read from the FIFO's. In this router design has four outputs i.e. 16-Bit size and one 12- bit data port. The ROUTER can operate with a single master device and with one or more slave devices. If a single slave device is used, the RE (read enable) pin may be fixed to logic low if the slave permits it. Some slaves require the falling edge (HIGH $\rightarrow$ LOW transition) of the slave select to initiate an action such as the mobile operators, which starts conversion on said transition.

# First In First Out (FIFO):

FIFO is a very popular and useful design block for purpose of synchronization and a handshaking mechanism between the modules.

**Depth of FIFO:** The number of slots or rows in FIFO is called the depth of the FIFO.

**Width of FIFO:** The number of bits that can be stored in each slot or row is called the width of the FIFO.

There are two types of FIFOs

1. Synchronous FIFO

### 2. Asynchronous FIFO

### **Synchronous FIFO**

In Synchronous FIFO, data read and write operations use the same clock frequency. Usually, they are used with high clock frequency to support high-speed systems.



Fig2: FIFO representation

### **Synchronous FIFO Operation**

# Signals:

wr\_en: write enable wr data: write data

full: FIFO is full

empty: FIFO is empty

rd\_en: read enable rd\_data: read data

w\_ptr: write pointer

r ptr: read pointer

# FIFO write operation

FIFO can store/write the wr\_data at every posedge of the clock based on wr\_en signal till it is full. The write pointer gets incremented on every data write in FIFO memory.

# FIFO read operation

The data can be taken out or read from FIFO at every posedge of the clock based on the rd\_en signal till it is empty. The read pointer gets incremented on every data read from FIFO memory.

## **ARBITER:**

An **Arbiter** is used to provide access of data bus whenever there are more than one requesters for the bus, we can say arbiter is like a traffic police constable who grant the access of road to the vehicle drivers according to the traffic rules.

DIO: 30.0405/ijearst.2025.2505

https://ijearst.co.in/

Consider the case of a System-on-Chip (SoC). When we say SoC it symbolizes a complete system

compromising of different units on single chip. In SoC the data transfer between different units takes

place via single bus.

Round-Robin arbiter

An arbiter in which the priority of requesters is set in a way that every requester get resource equally.

In **Round-Robin arbiter** the priority of every requester keep changing with each allocation of resource.

Here all the requesters are provided with the resource in a round fashion. Consider a state let it be 01,

now to get resource again it have to pass through all other states. Initially when there is no request

already, the first requester is given highest priority. Now request from next requester is given lower

priority than previous one. This process is continued till we assign priority to all requesters. Now the

requester with highest priority is allotted resource and now we change its priority to the lowest and

priority of other requesters get incremented by one. Now next time again the requester with highest

priority is allotted the resource and this process is continued till there is no other requests are left.

Before we dive into the design of the round-robin arbiter, let's first discuss why it is important. The

round-robin arbiter is used to prevent starvation and provide statistical fairness in a system. Starvation

occurs when a request is repeatedly denied access to the shared resource, even though other requests

are being granted access. The round-robin arbiter solves this problem by granting access to each request

in a circular fashion, ensuring that no request is consistently denied access to the resource. Statistical

fairness is also important in systems with multiple requests for a shared resource. Without statistical

fairness, a small number of requests may be granted access to the resource more often than other

requests. This can lead to performance degradation or even system failure. The round-robin arbiter

ensures that each request has an equal chance of being granted access to the resource, which promotes

statistical fairness.

Crossbar switch: In electronics and telecommunications, a crossbar switch (cross-point switch, matrix

switch) is a collection of switches arranged in a matrix configuration. A crossbar switch has multiple

input and output lines that form a crossed pattern of interconnecting lines between which a connection

may be established by closing a switch located at each intersection, the elements of the matrix.

Originally, a crossbar switch consisted literally of crossing metal bars that provided the input and output

paths. Later implementations achieved the same switching topology in solid-state electronics. The

crossbar switch is one of the principal telephone exchange architectures, together with a rotary switch,

memory switch,[2] and a crossover switch.

This crossbar design can be and has been implemented in various technologies, including specialized

VLSI implementations with Domino logic.



Figure 3. (a) Crossbar switch states. (b) Typical design of crossbar switch.

A module containing a combination of muxes and demuxes is referred to as a crossbar. An output port and an input port are connected. There is no feedback in this design. At any given time, only one link can be established by Crossbar. Binary outputs are Cout, Eout, Nout, Sout, and Wout from inputs of Cin, Ein, Nin, Sin, and Win. Both the input and the output are 16-bit. This design has a select line that is reduced to two lines, creating four binary combinations.

| Input<br>port | Select | Output Port |
|---------------|--------|-------------|
| SC            | 00     | CNI         |
| SC            | 01     | CWI         |
| SC            | 10     | CEI         |
| SC            | 11     | CSI         |
| SN            | 00     | NWI         |
| SN            | 01     | NEI         |
| SN            | 10     | NSI         |
| SN            | 11     | NCI         |
| SS            | 00     | SCI         |
| SS            | 01     | SNI         |
| SS            | 10     | SWI         |
| SS            | 11     | SEI         |
| SE            | 00     | ESI         |
| SE            | 01     | ECI         |
| SE            | 10     | ENI         |
| SE            | 11     | EWI         |
| SW            | 00     | WEI         |
| SW            | 01     | WSI         |
| SW            | 10     | WCI         |
| SW            | 11     | WNI         |

After designing all these blocks individually, using structural modelling, all these blocks are combined and simulated. Using XPower analyser, the power consumption rating is obtained for this router design. In paper [15], it is mentioned that buffers will help in reducing the latency. So, based on this the designed is modified with an additional buffer at each output port of the router.

### **BUFFER:**

Digital Buffers and Tri-state Buffers can provide current amplification in a digital circuit to drive output loads. The buffer is adaptive, that is it continuously learns the optimal delay to be applied to the audio flow at run-time. Once the optimal delay has been learned, the delay buffer will apply this delay to the audio flow, expanding or shrinking the audio samples as necessary when the actual audio samples in the buffer are too low or too high.



Fig4: Buffer



Fig5: Existing Block of FIFO Organisation

# **INPUT BUFFER:**

The Input buffer is also commonly known as the input area or input block.

When referring to computer memory, the input buffer is a location that holds all incoming information before it continues to the CPU for processing.

Input buffer can be also used to describe various other hardware or software buffers used to store information before it is processed.

### **MEMORY BLOCK:**

(RAM) Random-access memory (RAM) is a form of computer data storage. Today, it takes the form of integrated circuits that allow stored data to be accessed in any order (that is, at random). "Random" refers to the idea that any piece of data can be returned in a constant time, regardless of its physical location and whether it is related to the previous piece of data.

The word "RAM" is often associated with volatile types of memory (such as DRAMmemory modules), where the information is lost after the power is switched off. Many other types of memory are RAM as well, including most types of ROM and a type of flash memory called NOR-Flash.

Scan design has been the backbone of design for testability (DFT) in industry for about three decades because scan-based design can successfully obtain controllability and observability for flip-flops. Serial Scan design has dominated the test architecture because it is convenient to build. However, the serial scan design causes unnecessary switching activity during testing which induce unnecessarily enormous power dissipation.

### RING COUNTER:

A **ring counter** is a type of counter composed of a circular shift register. The output of the last shift register is fed to the input of the first register.

There are two types of ring counters: A *straight ring counter* or *Overbeck counter* connects the output of the last shift register to the first shift register input and circulates a single one (or zero) bit around the ring. For example, in a 4-register one-hot counter, with initial register values of 1000, the repeating pattern is: 1000, 0100, 0010, 0001, 1000.... Note that one of the registers must be pre-loaded with a 1 (or 0) in order to operate properly. The waveforms for all four stages look the same, except for the one clock time delay from one stage to the next. See figure below.



Load 1000 into 4-stage ring counter and shift

Fig6: Ring counter Wave

### **PROPOSED METHOD:**



Fig7: Proposed Block diagram

### **Modified Buffer architecture:**



Fig 8 Block Diagram For Proposed Buffer

### **GATED DRIVER TREE:**



Fig9: Gated Driver Tree

Gated driver tree derived from the same clock gating signals of the blocks that they drive. Thus, in a quad-tree clock distribution network, the "gate" signal of the gate driver at the level (CKE) should be asserted when the active DET flip-flop

## **MODIFIED RING COUNTER:**



Fig10 Modified Ring Counter

### **DET** (Double edge triggered flip-flops:

Double-edge-triggered (DET) flip-flops are utilized to reduce the operating frequency by half The logic construction of a double-edge-triggered (DET) flip-flop, which can receive input signal at two levels the clock, is analyzed and a new circuit design of CMOS DET In this paper, we propose to use double-edge-triggered (DET) flip-flops instead of traditional DFFs in the ring counter to halve the operating clock frequency. Double edge-triggered flipflops are becoming a popular technique for low-power designs since they effectively enable a halving of the clock frequency.

### **C ELEMENT:**

The Muller **C-element**, or Muller C-gate, is a commonly used asynchronous logic component originally designed by David E. Muller. It applies logical operations on the inputs and has hysteresis. The output of the C-element reflects the inputs when the states of all inputs match. The output then remains in this state until the inputs all transition to the other state. This model can be extended to the Asymmetric C-element where some inputs only effect the operation in one of the transitions (positive or negative). The figure shows the gate-level and transistor-level implementations and symbol of the C-element.

Here is the truth table for a 2-input c-gate.  $Y_{n-1}$  denotes a "no change" condition.



Fig11: C- Element

The C-element stores its previous state with two cross-coupled inverters, similar to an SRAM cell. One of the inverters is weaker than the rest of the circuit, so it can be overpowered by the pull-up and pull down networks. If both inputs are 0, then the pull-up network changes the latch's state, and the C-element outputs a 0. If both inputs are 1, then the pull-down network changes the latch's state, making the C-element output a 1. Otherwise, the input of the latch is not connected to either  $V_{dd}$  or ground, and so the weak inverter (drawn smaller in the diagram) dominates and the latch outputs its previous state. Muller **C-element** was first used in the arithmetic logic unit (ALU) of the ILLIAC II supercomputer, proposed in 1958, and operational in 1962.

### **RESULTS:**



Fig a: Proposed Simulation results

**Table a: Comparison Results** 

| Parameter           | Existing method | Proposed method |
|---------------------|-----------------|-----------------|
| Power ( <u>mW</u> ) | 156             | 114             |
| Area (Gate count)   | 161             | 74              |
| Time (ns)           | 4.905           | 4.512           |

#### **CONCLUSION:**

This concept presents a novel approach that improves NoC throughput by packet prioritization. The objective is to increase the NoC throughput by using congestion aware information. Congestion awareness information has already been applied for on-chip communication to improve NoC routing. The proposed method increases the efficiency of data transmission by reducing power, area when compared to the existing model. This is achieved by adding modified buffers at the output port. The proposed router structure functionality is implemented in Verilog HDL and proven that this architecture consumes less resources in terms of no of LUT"S ,slices and no of IO Buffers . In this paper the Xilinx ISE EDA Tool is used for synthesis and for simulation. The data which can be send through the router is reached the destination with less power consumption.

### REFERENCES

- [1] Channamallikarjuna Mattihalli, Suprith Ron, Naveen Kolla "VLSI Based Robust Router Architecture" Third International Conference on Intelligent Systems Modelling and Simulation, 2012.
- [2] Feng Liang, Luwen Zhang, Shaochong Lei, Guohe Zhang, Kaile Gao, and Bin Liang" Test Patterns of Multiple SIC Vectors: Theory and Application in BIST Schemes" ieee transactions on very large scale integration (vlsi) systems, vol. 21, no. 4, april 2013.
- [3]James Aweya "IP Router Architectures: An Overview".
- [4]M. SOWMYA C. SHIREESHA G. SWETHA PRAKASH J. PATIL" VLSI Based Robust Router Architecture".
- [5] M. Waldvogel, G. Varghese, J. Turner, and B. Plattner, "Scalable High Speed IP Routing Lookup," Proc. ACM SIGCOMM'97, Cannes, France, Sept. 1997.
- [6] V. Srinivasan and G. Varghese, "Faster IP Lookups using Controlled Prefix Expansion," Proc. ACM SIGMETRICS, May 1998.
- [7] S. Nilsson and G. Karlsson, "Fast Address Look-Up for Internet Routers," Proc. Of IEEE Broadband Communications'98, April 1998.
- [8]E. Filippi, V. Innocenti, and V. Vercellone, "Address Lookup Solutions for Gigabit Switch/Router," Proc. Globecom'98, Sydney, Australia, Nov. 1998.
- [9]M. Thottethodi, A. R. Lebeck, and S. S. Mukherjee, "BLAM: ahigh performance routing algorithm for virtual cut-through networks," in Proceedings of the International Parallel and Distributed Processing Symposium

- [10] L. S. Peh and W. J. Dally, "A delay model and speculative architecture for pipelined routers," in Proceedings of the 7thInternational Symposium on High Performance Computer Architecture (HPCA).
- [11] R. H. Bell, C. Y. Kang, L. John, and E. E. Swartzlander, "CDMA as a multiprocessor interconnect strategy," in Proc. Conf. Rec. 35th Asilomar Conf. Signals, Syst. Comput., vol. 2. Nov. 2001, pp. 1246–1250.
- [12] B. C. C. Lai, P. Schaumont, and I. Verbauwhede, "CTbus: A heterogeneous CDMA/TDMA bus for future SOC," in Proc. Conf. Rec. 35th Asilomar Conf. Signals, Syst. Comput., vol. 2. Nov. 2004, pp. 1868–1872.
- [13] S. A. Hosseini, O. Javidbakht, P. Pad, and F. Marvasti, "A review on synchronous CDMA systems: Optimum overloaded codes, channel capacity, and power control," EURASIP J. Wireless Commun. Netw., vol. 1, pp. 1–22, Dec. 2011.
- [14] K. E. Ahmed and M. M. Farag, "Overloaded CDMA bus topology for MPSoC interconnect," in Proc. Int. Conf. ReConFigurable Comput. FPGAs (ReConFig), Dec. 2014, pp. 1–7. [15] K. E. Ahmed and M. M. Farag, "Enhanced overloaded CDMA interconnect (OCI) bus architecture for onchip communication," in Proc. IEEE 23rd Annu. Symp. High-Perform. Interconnects (HOTI), Aug. 2015, pp. 78–87.
- [16] T. Nikolic, G. Djordjevic, and M. Stojcev, "Simultaneous data transfers over peripheral bus using CDMA technique," in Proc. 26th Int. Conf. Microelectron. (MIEL), May 2008, pp. 437–440.
- [17] X. Wang, T. Ahonen, and J. Nurmi, "Applying CDMA technique to network-on-chip," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 10, pp. 1091–1100, Oct. 2007. [18] D. Kim, M. Kim, and G. E. Sobelman, "CDMAbased network-on-chip architecture," in Proc. IEEE Asia–Pacific Conf. Circuits Syst., vol. 1. Dec. 2004, pp. 137–140.
- [19] D. Kim, M. Kim, and G. E. Sobelman, "Design of a high-performance scalable CDMA router for on-chip switched networks," in Proc. Int. SoC Des. Conf, Nov. 2005, pp. 32–35.
- [20] Venkataraman, N. L., Rajagopal Kumar, and P. Mohamed Shakeel. "Ant lion optimized bufferless routing in the design of low power application specific network on chip." Circuits, Systems, and Signal Processing 39.2 (2020): 961-976.
- [21] S. Madhavan and H. P. V, "Design and Verification of 1X5 ROUTER," 2022 IEEE 2nd Mysore Sub Section International Conference (MysuruCon), Mysuru, India, 2022, pp. 1-6, doi: 10.1109/MysuruCon55714.2022.9972633.
- [22] J. A. Williams, N. W. Bergmann and X. Xie, "FIFO communication models in operating systems for reconfigurable computing," 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05), Napa, CA, USA, 2005, pp. 277-278, doi: 10.1109/FCCM.2005.35.